{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training Hybrid Recommendation Model with the MovieLens Dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Learning objectives**\n",
    "1. Extract user and product factors from a BigQuery Matrix Factorizarion Model.\n",
    "2. Format inputs for a BigQuery Hybrid Recommendation Model.\n",
    "\n",
    "## Introduction\n",
    "\n",
    "The matrix factorization approach does not use any information about users or movies beyond what is available from the ratings matrix. However, we will often have user information (such as the city they live, their annual income, their annual expenditure, etc.) and we will almost always have more information about the products in our catalog. How do we incorporate this information in our recommendation model?\n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/als_bqml_hybrid.ipynb) -- try to complete that notebook first before reviewing this solution notebook. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tensorflow as tf\n",
    "PROJECT = \"your-project-id-here\" # REPLACE WITH YOUR PROJECT ID\n",
    "\n",
    "# Do not change these\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"TFVERSION\"] = '2.6'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import the dataset and trained model\n",
    "In the previous notebook, you imported 20 million movie recommendations and trained an ALS model with BigQuery ML\n",
    "\n",
    "To save you the steps of having to do so again (if this is a new environment) you can run the below commands to copy over the clean data and trained model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First create the BigQuery dataset and copy over the data. If you get **already exists** in the output, please move forward in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "mkdir bqml_data\n",
    "cd bqml_data\n",
    "curl -O 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'\n",
    "unzip ml-20m.zip\n",
    "yes | bq rm -r $PROJECT:movielens\n",
    "bq --location=US mk --dataset \\\n",
    "    --description 'Movie Recommendations' \\\n",
    "    $PROJECT:movielens\n",
    "bq --location=US load --source_format=CSV \\\n",
    "    --autodetect movielens.ratings gs://cloud-training/recommender-systems/movielens/ratings.csv\n",
    "bq --location=US load  --source_format=CSV \\\n",
    "    --autodetect movielens.movies_raw gs://cloud-training/recommender-systems/movielens/movies.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, copy over the trained recommendation model. Note that if you're project is in the EU you will need to change the location from US to EU below. Note that as of the time of writing you cannot copy models across regions with `bq cp`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Table 'cloud-training-demos:movielens.recommender_16' successfully copied to 'qwiklabs-gcp-00-20dab82189fb:movielens.recommender_16'\n",
      "Table 'cloud-training-demos:movielens.recommender_hybrid' successfully copied to 'qwiklabs-gcp-00-20dab82189fb:movielens.recommender_hybrid'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Waiting on bqjob_r73d94064122a5447_00000175212ad0ee_1 ... (0s) Current status: DONE   \n",
      "Waiting on bqjob_r754ab701f3646228_00000175212adb6a_1 ... (0s) Current status: DONE   \n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "bq --location=US cp \\\n",
    "cloud-training-demos:movielens.recommender_16 \\\n",
    "movielens.recommender_16\n",
    "\n",
    "bq --location=US cp \\\n",
    "cloud-training-demos:movielens.recommender_hybrid \\\n",
    "movielens.recommender_hybrid"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On examining the first few movies using the query following query, we can see that the genres column is a formatted string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT *\n",
    "FROM movielens.movies_raw\n",
    "WHERE movieId < 5"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can parse the genres into an array and rewrite the table as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE TABLE movielens.movies AS\n",
    "    SELECT * REPLACE(SPLIT(genres, \"|\") AS genres)\n",
    "    FROM movielens.movies_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT *\n",
    "FROM movielens.movies\n",
    "WHERE movieId < 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, ensure the model still works by invoking predictions for movie recommendations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_rating</th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>userId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>11.496418</td>\n",
       "      <td>51632</td>\n",
       "      <td>Atlantis: Milo's Return (2003)</td>\n",
       "      <td>903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>11.431917</td>\n",
       "      <td>127390</td>\n",
       "      <td>Family Guy Presents: Blue Harvest (2007)</td>\n",
       "      <td>903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>11.428360</td>\n",
       "      <td>89939</td>\n",
       "      <td>Gigi (1949)</td>\n",
       "      <td>903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>11.372779</td>\n",
       "      <td>83359</td>\n",
       "      <td>Play House, The (1921)</td>\n",
       "      <td>903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>9.750968</td>\n",
       "      <td>97662</td>\n",
       "      <td>Girl in Progress (2012)</td>\n",
       "      <td>903</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   predicted_rating  movieId                                     title  userId\n",
       "0         11.496418    51632           Atlantis: Milo's Return (2003)      903\n",
       "1         11.431917   127390  Family Guy Presents: Blue Harvest (2007)     903\n",
       "2         11.428360    89939                               Gigi (1949)     903\n",
       "3         11.372779    83359                    Play House, The (1921)     903\n",
       "4          9.750968    97662                   Girl in Progress (2012)     903"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT * FROM\n",
    "ML.PREDICT(MODEL `movielens.recommender_16`, (\n",
    "  SELECT \n",
    "    movieId, title, 903 AS userId\n",
    "  FROM movielens.movies, UNNEST(genres) g\n",
    "  WHERE g = 'Comedy'\n",
    "))\n",
    "ORDER BY predicted_rating DESC\n",
    "LIMIT 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Incorporating user and movie information \n",
    "The matrix factorization approach does not use any information about users or movies beyond what is available from the ratings matrix. However, we will often have user information (such as the city they live, their annual income, their annual expenditure, etc.) and we will almost always have more information about the products in our catalog. How do we incorporate this information in our recommendation model?\n",
    "\n",
    "The answer lies in recognizing that the user factors and product factors that result from the matrix factorization approach end up being a concise representation of the information about users and products available from the ratings matrix. We can concatenate this information with other information we have available and train a regression model to predict the rating.\n",
    "### Obtaining user and product factors\n",
    "We can get the user factors or product factors from ML.WEIGHTS. For example to get the product factors for movieId=96481 and user factors for userId=54192, we would do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>processed_input</th>\n",
       "      <th>feature</th>\n",
       "      <th>factor_weights</th>\n",
       "      <th>intercept</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>movieId</td>\n",
       "      <td>96481</td>\n",
       "      <td>[{\"factor\":16,\"weight\":4.7488055154865094e-16}...</td>\n",
       "      <td>-1.398794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>userId</td>\n",
       "      <td>54192</td>\n",
       "      <td>[{\"factor\":16,\"weight\":1.451774918706243},{\"fa...</td>\n",
       "      <td>1.839885</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  processed_input feature                                     factor_weights  \\\n",
       "0         movieId   96481  [{\"factor\":16,\"weight\":4.7488055154865094e-16}...   \n",
       "1          userId   54192  [{\"factor\":16,\"weight\":1.451774918706243},{\"fa...   \n",
       "\n",
       "   intercept  \n",
       "0  -1.398794  \n",
       "1   1.839885  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT \n",
    "    processed_input,\n",
    "    feature,\n",
    "    TO_JSON_STRING(factor_weights) AS factor_weights,\n",
    "    intercept\n",
    "FROM ML.WEIGHTS(MODEL `movielens.recommender_16`)\n",
    "WHERE\n",
    "    (processed_input = 'movieId' AND feature = '96481')\n",
    "    OR (processed_input = 'userId' AND feature = '54192')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiplying these weights and adding the intercept is how we get the predicted rating for this combination of movieId and userId in the matrix factorization approach.\n",
    "\n",
    "These weights also serve as a low-dimensional representation of the movie and user behavior. We can create a regression model to predict the rating given the user factors, product factors, and any other information we know about our users and products.\n",
    "### Creating input features\n",
    "The MovieLens dataset does not have any user information, and has very little information about the movies themselves. To illustrate the concept, therefore, let’s create some synthetic information about users:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE TABLE movielens.users AS\n",
    "SELECT\n",
    "    userId,\n",
    "    RAND() * COUNT(rating) AS loyalty,\n",
    "    CONCAT(SUBSTR(CAST(userId AS STRING), 0, 2)) AS postcode\n",
    "FROM\n",
    "  movielens.ratings\n",
    "GROUP BY userId"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Input features about users can be obtained by joining the user table with the ML weights and selecting all the user information and the user factors from the weights array.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>loyalty</th>\n",
       "      <th>postcode</th>\n",
       "      <th>user_factors</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>65536</td>\n",
       "      <td>12.023909</td>\n",
       "      <td>65</td>\n",
       "      <td>[5.2869248538055835, 0.5717508872411037, -8.53...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>131072</td>\n",
       "      <td>18.668811</td>\n",
       "      <td>13</td>\n",
       "      <td>[9.205648021521588, 4.480600489131113, -6.9878...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>256</td>\n",
       "      <td>5.854007</td>\n",
       "      <td>25</td>\n",
       "      <td>[-6.81069854746253, 2.445484088458461, -0.1672...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>65792</td>\n",
       "      <td>37.295750</td>\n",
       "      <td>65</td>\n",
       "      <td>[-2.4468382816400025, 4.920656619448598, -5.54...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>131328</td>\n",
       "      <td>28.484759</td>\n",
       "      <td>13</td>\n",
       "      <td>[-6.25690712503519, -3.8063811222966732, -1.11...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId    loyalty postcode  \\\n",
       "0   65536  12.023909       65   \n",
       "1  131072  18.668811       13   \n",
       "2     256   5.854007       25   \n",
       "3   65792  37.295750       65   \n",
       "4  131328  28.484759       13   \n",
       "\n",
       "                                        user_factors  \n",
       "0  [5.2869248538055835, 0.5717508872411037, -8.53...  \n",
       "1  [9.205648021521588, 4.480600489131113, -6.9878...  \n",
       "2  [-6.81069854746253, 2.445484088458461, -0.1672...  \n",
       "3  [-2.4468382816400025, 4.920656619448598, -5.54...  \n",
       "4  [-6.25690712503519, -3.8063811222966732, -1.11...  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "WITH userFeatures AS (\n",
    "  SELECT \n",
    "     u.*,\n",
    "     (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights)) AS user_factors\n",
    "  FROM movielens.users u\n",
    "  JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w\n",
    "  ON processed_input = 'userId' AND feature = CAST(u.userId AS STRING)\n",
    ")\n",
    "\n",
    "SELECT * FROM userFeatures\n",
    "LIMIT 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we can get product features for the movies data, except that we have to decide how to handle the genre since a movie could have more than one genre. If we decide to create a separate training row for each genre, then we can construct the product features using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>g</th>\n",
       "      <th>product_factors</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>6441</td>\n",
       "      <td>Battle of Britain (1969)</td>\n",
       "      <td>War</td>\n",
       "      <td>[-0.05378647448089614, 0.12521521613203146, 0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>117750</td>\n",
       "      <td>The Real Glory (1939)</td>\n",
       "      <td>War</td>\n",
       "      <td>[4.996003610813204e-16, -7.112366251504901e-16...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>100574</td>\n",
       "      <td>Raid on Rommel (1971)</td>\n",
       "      <td>War</td>\n",
       "      <td>[-0.04352516797793908, 0.028588098837068587, -...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1450</td>\n",
       "      <td>Prisoner of the Mountains (Kavkazsky plennik) ...</td>\n",
       "      <td>War</td>\n",
       "      <td>[-0.0068831575882775786, -0.03107592153404104,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>118296</td>\n",
       "      <td>The Diary of an Unknown Soldier (1959)</td>\n",
       "      <td>War</td>\n",
       "      <td>[-2.4286128663675334e-16, 4.163336342344337e-1...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                                              title    g  \\\n",
       "0     6441                           Battle of Britain (1969)  War   \n",
       "1   117750                              The Real Glory (1939)  War   \n",
       "2   100574                              Raid on Rommel (1971)  War   \n",
       "3     1450  Prisoner of the Mountains (Kavkazsky plennik) ...  War   \n",
       "4   118296             The Diary of an Unknown Soldier (1959)  War   \n",
       "\n",
       "                                     product_factors  \n",
       "0  [-0.05378647448089614, 0.12521521613203146, 0....  \n",
       "1  [4.996003610813204e-16, -7.112366251504901e-16...  \n",
       "2  [-0.04352516797793908, 0.028588098837068587, -...  \n",
       "3  [-0.0068831575882775786, -0.03107592153404104,...  \n",
       "4  [-2.4286128663675334e-16, 4.163336342344337e-1...  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "WITH productFeatures AS (\n",
    "  SELECT \n",
    "      p.* EXCEPT(genres),\n",
    "      g, (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights))\n",
    "            AS product_factors\n",
    "  FROM movielens.movies p, UNNEST(genres) g\n",
    "  JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w\n",
    "  ON processed_input = 'movieId' AND feature = CAST(p.movieId AS STRING)\n",
    ")\n",
    "\n",
    "SELECT * FROM productFeatures\n",
    "LIMIT 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining these two WITH clauses and pulling in the rating corresponding the movieId-userId combination (if it exists in the ratings table), we can create the training dataset.\n",
    "\n",
    "**TODO 1**: Combine the above two queries to get the user factors and product factor for each rating."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE**: The below cell will take approximately 4~5 minutes for the completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE TABLE movielens.hybrid_dataset AS\n",
    "\n",
    "    WITH userFeatures AS (\n",
    "      SELECT \n",
    "         u.*,\n",
    "         (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights))\n",
    "            AS user_factors\n",
    "      FROM movielens.users u\n",
    "      JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w\n",
    "      ON processed_input = 'userId' AND feature = CAST(u.userId AS STRING)\n",
    "    ),\n",
    "\n",
    "    productFeatures AS (\n",
    "      SELECT \n",
    "          p.* EXCEPT(genres),\n",
    "          g, (SELECT ARRAY_AGG(weight) FROM UNNEST(factor_weights))\n",
    "            AS product_factors\n",
    "      FROM movielens.movies p, UNNEST(genres) g\n",
    "      JOIN ML.WEIGHTS(MODEL movielens.recommender_16) w\n",
    "      ON processed_input = 'movieId' AND feature = CAST(p.movieId AS STRING)\n",
    "    )\n",
    "\n",
    "    SELECT\n",
    "        p.* EXCEPT(movieId),\n",
    "        u.* EXCEPT(userId),\n",
    "        rating \n",
    "    FROM productFeatures p, userFeatures u\n",
    "    JOIN movielens.ratings r\n",
    "    ON r.movieId = p.movieId AND r.userId = u.userId"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the rows of this table looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>g</th>\n",
       "      <th>product_factors</th>\n",
       "      <th>loyalty</th>\n",
       "      <th>postcode</th>\n",
       "      <th>user_factors</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>Traffic (2000)</td>\n",
       "      <td>Crime</td>\n",
       "      <td>[0.050302306705910786, 0.026067456450144123, 0...</td>\n",
       "      <td>799.017058</td>\n",
       "      <td>29</td>\n",
       "      <td>[5.420790935241976, 2.2204505040712856, -4.530...</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            title      g                                    product_factors  \\\n",
       "0  Traffic (2000)  Crime  [0.050302306705910786, 0.026067456450144123, 0...   \n",
       "\n",
       "      loyalty postcode                                       user_factors  \\\n",
       "0  799.017058       29  [5.420790935241976, 2.2204505040712856, -4.530...   \n",
       "\n",
       "   rating  \n",
       "0     2.0  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT *\n",
    "FROM movielens.hybrid_dataset\n",
    "LIMIT 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Essentially, we have a couple of attributes about the movie, the product factors array corresponding to the movie, a couple of attributes about the user, and the user factors array corresponding to the user. These form the inputs to our “hybrid” recommendations model that builds off the matrix factorization model and adds in metadata about users and movies.\n",
    "### Training hybrid recommendation model\n",
    "At the time of writing, BigQuery ML can not handle arrays as inputs to a regression model. Let’s, therefore, define a function to convert arrays to a struct where the array elements are its fields:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE FUNCTION movielens.arr_to_input_16_users(u ARRAY<FLOAT64>)\n",
    "RETURNS \n",
    "    STRUCT<\n",
    "        u1 FLOAT64,\n",
    "        u2 FLOAT64,\n",
    "        u3 FLOAT64,\n",
    "        u4 FLOAT64,\n",
    "        u5 FLOAT64,\n",
    "        u6 FLOAT64,\n",
    "        u7 FLOAT64,\n",
    "        u8 FLOAT64,\n",
    "        u9 FLOAT64,\n",
    "        u10 FLOAT64,\n",
    "        u11 FLOAT64,\n",
    "        u12 FLOAT64,\n",
    "        u13 FLOAT64,\n",
    "        u14 FLOAT64,\n",
    "        u15 FLOAT64,\n",
    "        u16 FLOAT64\n",
    "    > AS (STRUCT(\n",
    "        u[OFFSET(0)],\n",
    "        u[OFFSET(1)],\n",
    "        u[OFFSET(2)],\n",
    "        u[OFFSET(3)],\n",
    "        u[OFFSET(4)],\n",
    "        u[OFFSET(5)],\n",
    "        u[OFFSET(6)],\n",
    "        u[OFFSET(7)],\n",
    "        u[OFFSET(8)],\n",
    "        u[OFFSET(9)],\n",
    "        u[OFFSET(10)],\n",
    "        u[OFFSET(11)],\n",
    "        u[OFFSET(12)],\n",
    "        u[OFFSET(13)],\n",
    "        u[OFFSET(14)],\n",
    "        u[OFFSET(15)]\n",
    "));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which gives:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>u1</th>\n",
       "      <th>u2</th>\n",
       "      <th>u3</th>\n",
       "      <th>u4</th>\n",
       "      <th>u5</th>\n",
       "      <th>u6</th>\n",
       "      <th>u7</th>\n",
       "      <th>u8</th>\n",
       "      <th>u9</th>\n",
       "      <th>u10</th>\n",
       "      <th>u11</th>\n",
       "      <th>u12</th>\n",
       "      <th>u13</th>\n",
       "      <th>u14</th>\n",
       "      <th>u15</th>\n",
       "      <th>u16</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    u1   u2   u3   u4   u5   u6   u7   u8   u9  u10   u11   u12   u13   u14  \\\n",
       "0  0.0  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0  10.0  11.0  12.0  13.0   \n",
       "\n",
       "    u15   u16  \n",
       "0  14.0  15.0  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "SELECT movielens.arr_to_input_16_users(u).*\n",
    "FROM (SELECT\n",
    "    [0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15.] AS u)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a similar function named movielens.arr_to_input_16_products to convert the product factor array into named columns.\n",
    "\n",
    "**TODO 2**: Create a function that returns named columns from a size 16 product factor array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE FUNCTION movielens.arr_to_input_16_products(p ARRAY<FLOAT64>)\n",
    "RETURNS \n",
    "    STRUCT<\n",
    "        p1 FLOAT64,\n",
    "        p2 FLOAT64,\n",
    "        p3 FLOAT64,\n",
    "        p4 FLOAT64,\n",
    "        p5 FLOAT64,\n",
    "        p6 FLOAT64,\n",
    "        p7 FLOAT64,\n",
    "        p8 FLOAT64,\n",
    "        p9 FLOAT64,\n",
    "        p10 FLOAT64,\n",
    "        p11 FLOAT64,\n",
    "        p12 FLOAT64,\n",
    "        p13 FLOAT64,\n",
    "        p14 FLOAT64,\n",
    "        p15 FLOAT64,\n",
    "        p16 FLOAT64\n",
    "    > AS (STRUCT(\n",
    "        p[OFFSET(0)],\n",
    "        p[OFFSET(1)],\n",
    "        p[OFFSET(2)],\n",
    "        p[OFFSET(3)],\n",
    "        p[OFFSET(4)],\n",
    "        p[OFFSET(5)],\n",
    "        p[OFFSET(6)],\n",
    "        p[OFFSET(7)],\n",
    "        p[OFFSET(8)],\n",
    "        p[OFFSET(9)],\n",
    "        p[OFFSET(10)],\n",
    "        p[OFFSET(11)],\n",
    "        p[OFFSET(12)],\n",
    "        p[OFFSET(13)],\n",
    "        p[OFFSET(14)],\n",
    "        p[OFFSET(15)]\n",
    "));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we can tie together metadata about users and products with the user factors and product factors obtained from the matrix factorization approach to create a regression model to predict the rating:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE**: The below cell will take approximately 25~30 minutes for the completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1655.43query/s]"
     ]
    }
   ],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "CREATE OR REPLACE MODEL movielens.recommender_hybrid \n",
    "OPTIONS(model_type='linear_reg', input_label_cols=['rating'])\n",
    "AS\n",
    "\n",
    "SELECT\n",
    "  * EXCEPT(user_factors, product_factors),\n",
    "    movielens.arr_to_input_16_users(user_factors).*,\n",
    "    movielens.arr_to_input_16_products(product_factors).*\n",
    "FROM\n",
    "  movielens.hybrid_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no point looking at the evaluation metrics of this model because the user information we used to create the training dataset was fake (not the RAND() in the creation of the loyalty column) -- we did this exercise in order to demonstrate how it could be done. And of course, we could train a dnn_regressor model and optimize the hyperparameters if we want a more sophisticated model. But if we are going to go that far, it might be better to consider using [Auto ML tables](https://cloud.google.com/automl-tables).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
